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Abstract 

Nonnegative Matrix Factorization (NMF) is the problem of approximating a nonnegative ma- 
trix with the product of two low-rank nonnegative matrices and has been shown to be particularly 
useful in many applications, e.g., in text mining, image processing, computational biology, etc. In 
this paper, we explain how algorithms for NMF can be embedded into the framework of multi- 
level methods in order to accelerate their convergence. This technique can be applied in situations 
where data admit a good approximate representation in a lower dimensional space through linear 
transformations preserving nonnegativity. A simple multilevel strategy is described and is experi- 
mentally shown to speed up significantly three popular NMF algorithms (alternating nonnegative 
least squares, multiplicative updates and hierarchical alternating least squares) on several standard 
image datasets. 

Keywords: nonnegative matrix factorization, algorithms, multigrid and multilevel methods, image 
processing. 



1 Introduction 

Nonnegative Matrix Factorization (NMF) consists in approximating a nonnegative matrix as the 
product of two low-rank nonnegative matrices [25] . More precisely, given an m-by-n nonnegative 
matrix M and a factorization rank r, we would like to find two nonnegative matrices V and W of 
dimensions m-by-r and r-by-n respectively such that 

M « VW. 

This decomposition can be interpreted as follows: denoting by M-j the j th column of M, by V± the 
k th column of V and by Wkj the entry of W located at position (k,j), we want 

r 

M,j « J2w kj V k , l<j<n, 
k=l 

so that each given (nonnegative) vector M-j is approximated by a nonnegative linear combination of r 
nonnegative basis elements V±. Both the basis elements and the coefficients of the linear combinations 
have to be found. Nonnegativity of vectors V± ensures that these basis elements belong to the same 
space as the columns of M and can then be interpreted in the same way. Moreover, the additive 
reconstruction due to nonnegativity of coefficients W k j leads to a part-based representation [26]: basis 
elements V± will tend to represent common parts of the columns of M. For example, let each column 
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of M be a vectorized gray-level image of a face using (nonnegative) pixel intensities. The nonnegative 
matrix factorization of M will generate a matrix V whose columns are nonnegative basis elements of 
the original images, which can then be interpreted as images as well. Moreover, since each original 
face is reconstructed through a weighted sum of these basis elements, the latter will provide common 
parts extracted from the original faces, such as eyes, noses and lips. Figure Q] illustrates this property 
of the NMF decomposition. 
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Figure 1: Illustration of NMF on a face database. Basis elements (matrix V) obtained with NMF 
on the CBCL Face Database #1, MIT Center For Biological and Computation Learning, available at 
http://cbcl.mit.edu/cbcl/software-datasets/FaceData2.html, consisting of 2429 gray-level images of 
faces (columns) with 19 x 19 pixels (rows), for which we set the factorization rank equal to r = 49. 



One of the main challenges of NMF is to design fast and efficient algorithms generating the non- 
negative factors. In fact, on the one hand, practitioners need to compute rapidly good factorizations 
for large-scale problems (e.g., in text mining or image processing); on the other hand, NMF is a 
NP-hard problem [37] and we cannot expect to find a globally optimal solution in a reasonable com- 
putational time. This paper presents a general framework based on a multilevel strategy leading to 
faster initial convergence of NMF algorithms when dealing with data admitting a simple approximate 
low-dimensional representation (using linear transformations preserving nonnegativity) , such as im- 
ages. In fact, in these situations, a hierarchy of lower-dimensional problems can be constructed and 
used to compute efficiently approximate solutions of the original problem. Similar techniques have 
already been used for other dimensionality reduction tasks such as PCA [33] . 

The paper is organized as follows: NMF is first formulated as an optimization problem and three 
well-known algorithms (ANLS, MU and HALS) are briefly presented. We then introduce the concept 
of multigrid/multilevel methods and show how and why it can be used to speed up NMF algorithms. 
Finally, we experimentally demonstrate the usefulness of the proposed technique on several standard 
image databases, and conclude with some remarks on limitations and possible extensions of this 
approach. 
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2 Algorithms for NMF 

NMF is typically formulated as a nonlinear optimization problem with an objective function measuring 
the quality of the low-rank approximation. In this paper, we consider the sum of squared errors: 

min \\M-VW\\ 2 F s.t. V > 0, W > 0, (NMF) 
V G R mxr 

W e W xn 

i.e., use the squared Frobenius norm = J^j j ^ij °f the approximation error. Since this standard 

formulation of (jNMFp is NP-hard |37j . most NMF algorithms focus on finding locally optimal solutions. 
In general, only convergence to stationary points of (jNMFp (points satisfying the necessary first-order 
optimality conditions) is guaranteed. 



2.1 Alternating Nonnegative Least Squares (ANLS) 

Although (jNMFp is a nonconvex problem, it is convex separately in each of the two factors V and 
W, i.e., finding the optimal factor V corresponding to a fixed factor W reduces to a convex optimiza- 
tion problem, and vice-versa. More precisely, this convex problem corresponds to a nonnegative least 
squares (NNLS) problem, i.e., a least squares problem with nonnegativity constraints. The so-called 
alternating nonnegative least squares (ANLS) algorithm for (jNMFp minimizes (exactly) the cost func- 
tion alternatively over factors V and W so that a stationary point of (jNMFp is obtained in the limit 
A frequent strategy to solve the NNLS subproblems is to use active-set methods [25] (see lA"j) 



Algorithm 1 Alternating Nonnegative Least Squares 



Require: Data matrix M E M+ Xn and initial iterate W G 



while stopping criterion not met do 
V <- argmin y > ||M - VW|||; 
W <- argmin^oIlM - VTU|||. 

end while 



for which an efficient implementation^ is described in |36j [22] . We refer the reader to [6] for a survey 
about NNLS methods. 



2.2 Multiplicative Updates (MU) 

In [27] Lee and Seung propose multiplicative updates (MU) for (jNMFp which guarantee nonincreas- 
ingness of the objective function (cf. Algorithm [2]) . They also alternatively update V for W fixed and 
vice versa, using a technique which was originally proposed by Daube-Witherspoon and Muehllehner 
to solve nonnegative least squares problems [13J. The popularity of this algorithm came along with 
the popularity of NMF. Algorithm [2] does not guarantee convergence to a stationary point (although 
it can be slightly modified in order to get this property [291 EB]) and it has been observed to converge 
relatively slowly, see [20| and the references therein. 



2.3 Hierarchical Alternating Least Squares (HALS) 

In ANLS, variables are partitioned at each iteration such that each subproblem is convex. However, the 
resolution of these convex NNLS subproblems is nontrivial and relatively expensive. If we optimize 

1 Available at http://www.cc.gatech.edu/-hpark/ Notice that an improved version based on a principal block 
pivoting method has been released recently, see |23l 124] . and for which our multilevel method is also applicable, see 
Section mi 
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Algorithm 2 Multiplicative Updates 



Require: Data matrix M G xn and initial iterates (V, W) G 



while stopping criterion not met do 

V <_ y o [ M ^ T1 



[v(ww j ')] • 

W-G- Wo 



[V T M] 



l(V T V)W] ■ 
end while 

denotes the Hadamard (component-wise) division. 



instead one single variable at a time, we get a simple univariate quadratic problem which admits 
a closed-form solution. Moreover, since the optimal value of each entry of V (resp. W) does not 
depend of the other entries of the same column (resp. row), one can optimize alternatively whole 
columns of V and whole rows of W. This method was first proposed by Cichocki et al. [10\ [8] and 
independently by [HI [T71 [28], and is herein referred to as Hierarchical Alternating Least Squares 
(HALS), see Algorithm [3l Under some mild assumptions, every limit point is a stationary point of 

Algorithm 3 Hierarchical Alternating Least Squares 

Require: Data M G R™ xn and initial iterates (V, W) G R+ xr x Wf n . 

while stopping criterion not met do 
2: Compute A = MW T and B = WW T . 
3: for k = 1 : r do 

V :k <- max fo, A:fc " EL R M . #fcK ' B ' fc ' 

5 
6 
7 

8 

9 
10 



B, 

end for 

Compute C = V T M and D = V T V . 
for k = 1 : r do 

end for 
end while 



dNMF]) . see [E]. 



3 Multigrid Methods 

In this section, we briefly introduce multigrid methods. The aim is to give the reader some insight on 
these techniques in order to comprehend their applications for NMF. We refer the reader to [31 H[ l5l 135] 
and the references therein for detailed discussions on the subject. 



Multigrid methods were initially used to develop fast numerical solvers for boundary value prob- 
lems. Given a differential equation on a continuous domain with boundary conditions, the aim is to 
find an approximation of a smooth function / satisfying the constraints. In general, the first step is 
to discretize the continuous domain, i.e., choose a set of points (a grid) where the function values will 
be approximated. Then, a numerical method (e.g., finite differences, finite elements) translates the 
continuous problem into a (square) system of linear equations: 

find s.t. Ax = b, with A G R nxn , b G IP, (3.1) 

where the vector x will contain the approximate values of / on the grid points. Linear system (|3.1|) 
can be solved either by direct methods (e.g., Gaussian elimination) or iterative methods (e.g., Jacobi 
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and Gauss-Seidel iterations). Of course, the computational cost of these methods depends on the 
number of points in the grid, which leads to a trade-off between precision (number of points used for 
the discretization) and computational cost. 

Iterative methods update the solution at each step and hopefully converge to a solution of (|3.ip . 
Here comes the utility of multigrid: instead of working on a fine grid during all iterations, the solution 
is restricted to a coarser grid_| on which the iterations are cheaper. Moreover, the smoothness of 
function / allows to recover its low-frequency components faster on coarser grids. Solutions of the 
coarse grid are then prolongated to the finer grid and iterations can continue (higher frequency com- 
ponents of the error are reduced faster). Because the initial guess generated on the coarser grid is a 
good approximation of the final solution, less iterations are needed to converge on the fine (expensive) 
grid. Essentially, multigrid methods make iterative methods more efficient, i.e., accurate solutions are 
obtained faster. 

More recently, these same ideas have been applied to a broader class of problems, e.g., multiscale 
optimization with trust-region methods [19] and multiresolution techniques in image processing [34j . 

4 Multilevel Approach for NMF 

The three algorithms presented in Section [2] (ANLS, MU and HALS) are iteratively trying to find 
a stationary point of (|NMFj) . Indeed, most practical NMF algorithms are iterative methods, such 
as projected gradient methods [30] and Newton- like methods [SJ [H] (see also [TJ [TJ HU EI] and the 
references therein). In order to embed these algorithms in a multilevel strategy, one has to define the 
different levels and describe how variables and data are transferred between them. In this section, we 
first present a general description of the multilevel approach for NMF algorithms, and then apply it 
to image datasets. 

4.1 Description 

Let each column of the matrix M be a element of the dataset (e.g., a vectorized image) belonging to 
WP;. We define the restriction operator TZ as a linear operator 

K : M™ -> R™' : x -> H(x) = Rx, 

with R € xm and m' < m, and the prolongation V as a linear operator 

V : R™' R™ : y -»• V(y) = Py, 

with P € R+ Xm . Nonnegativity of matrices R and P is a sufficient condition to preserve nonnegativity 
of the solutions when they are transferred from one level to another. In fact, in order to generate 
nonnegative solutions, one requires 

K{x) > 0, Vx > and V{y) > 0, My > 0. 

We also define the corresponding transfer operators on matrices, operating columnwise: 

TZ([xi x 2 . ■ ■ x n ]) = \R(xx) TZ(x 2 ) . . . TZ(x n )}, and 

P([yi V2-.. y n \) = [V(yi)V(y 2 ) ■ ■ ■ V(y n )l 

2 Standard multigrid techniques actually restrict the residual instead of the solution, see the discussion in Section T6. 21 
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for Xi € W£, yi € W™', 1 < i < n. 

In order for the multilevel strategy to work, information lost when transferring from one level 
to another must be limited, i.e., data matrix M has to be well represented by 1Z(M) in the lower 
dimensional space, which means that the reconstruction V(1Z(M)) must be close to M. From now on, 
we say that M is smooth with respect to 1Z and V if and only if 

\\M-V(1Z(M))\\ F . 

sm = ttttti is small . 

\\M\\ F 

Quantity % measures how well M can be mapped by 1Z into a lower-dimensional space, then brought 
back by V, and still be a fairly good approximation of itself. 

Based on these definitions, elaborating a multilevel approach for NMF is straightforward: 



1. We are given M e R™ xn and (V , W ) G M™ xr x R r + Xn ; 

2. Compute M' = K{M) = RM G M™' xn and Vft = K{V ) = RV G M+' xr , i.e., restrict the 
elements of your dataset and the basis elements of the current solution to a lower dimensional 
space; 

3. Compute a rank-r NMF (V',W) of M' using (Vq, Wo) as initial matrices, i.e., 

V'W r*M' = K(M). 

This can be done using any NMF iterative algorithm or, even better, using the multilevel strategy 
recursively (cf. Section |43|) . 

4. Since 

M w V{K{M)) = V(M') w V{V'W) = PV'W = V(V')W = VW, 

where V is the prolongation of V, (V, W) is a good initial estimate for a rank-r NMF of M, 
provided that M is smooth with respect to 1Z and V (i.e., sm is small) and that V'W is a good 
approximation of M' = 1Z(M) (i.e., \\M' — V'W\\f is small); in fact, 

\\M -V{V')W\\ F < \\M -V(K(M))\\ F + \\V(K(M)) -V(V'W)\\ F 

< s M \\M\\ F + \\V{TZ{M)-V'W)\\ F 

< s M \\M\\ F + \\P\\ F \\K{M)-V'W\\ F . 

5. Further improve the solution (V, W) using any NMF iterative algorithm. 



Computations needed at step 3 are cheap (since m! < m) and, moreover, the low-frequency com- 
ponents of the erroiH are reduced faster on coarse levels (cf . Section 14. 4p . Therefore this strategy is 
expected to accelerate the convergence of NMF algorithms. 

We now illustrate this technique on image datasets, more precisely, on two-dimensional gray-level 
images. In general, images are composed of several smooth components, i.e., regions where pixel values 
are similar and change continuously with respect to their location (e.g., skin on a face or, the pupil or 
sclera^ of an eye). In other words, a pixel value can often be approximated using the pixel values of its 
neighbors. This observation can be used to define the transfer operators (Section I4.2p . For the com- 
putation of a NMF solution, the multilevel approach can be used recursively; three strategies (called 
multigrid cycles) are described in Section 14.31 Finally, numerical results are reported in Section [SJ 

3 The low-frequency components refer to the parts of the data which are well-represented on coarse levels. 
4 The white part of the eye. 
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4.2 Coarse Grid and Transfer Operators 

A crucial step of multilevel methods is to define the different levels and the transformations (operators) 
between them. Figure [2] is an illustration of a standard coarse grid definition: we note I 1 the matrix 
of dimension (2 a + 1) x {2 h + 1) representing the initial image and the matrix of dimension (2 a ~ l+l + 
1) x (2 h ~ l+1 + 1) representing the image at level I obtained by keeping, in each direction, only one out 
of every two points of the grid at the preceding level, i.e., I . 



level 1 

(fine) 



level 2 

(middle) 



level 3 

(coarse) 




Figure 2: Multigrid Hierarchy. Schematic view of a grid definition for image processing (image from 
ORL face database, cf. Section [5]). 

The transfer operators describe how to transform the images when going from finer to coarser 
levels, and vice versa, i.e., how to compute the values (pixel intensities) of the image I using values 
from image at the finer level (restriction) or from image at the coarser level (prolongation). 
For the restriction, the full-weighting operator is a standard choice: values of the coarse grid points are 
the weighted average of the values of their neighbors on the fine grid (see Figure for an illustration) . 
Noting l\ ■ the intensity of the pixel (i, j) of image it is defined as follows: 



Restriction 



Prolongation 



Figure 3: Restriction and Prolongation. 
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(4.1) 



except on the boundaries of the image (when i = 0, j = 0, i = 2 a ~ l+l and/or j = 2 b ~ l+1 ) where the 
weights are adapted correspondingly. For example, to restrict a 3 x 3 image to a 2 x 2 image, 1Z is 
defined with 

/4202 10000\ 
024012000 
000210420 
\000012024/ 



(3x3 images needing first to be vectorized to vectors in M 9 , by concatenation of either columns or 
rows). 

For the prolongation, we set the values on the fine grid points as the average of the values of their 
neighbors on the coarse grid: 

= mean i / 6rd(i/2) (i^fj , (4.2) 

i'erdCj/2) 

where 

{k/2} k even, 

{(fc — l)/2, (fc + l)/2} k odd. 



rd(fc/2) = 

For example, to prolongate a 2 x 2 image to a 3 x 3 image, V is defined with 



P 1 



/4202 10000\ 
024012000 
000210420 
\000012024/ 



Note that these transformations clearly preserve nonnegativity. 



4.3 Multigrid Cycle 

Now that grids and transfer operators are defined, we need to choose the procedure that is applied 
at each grid level as it moves through the grid hierarchy. In this section, we propose three different 
approaches: nested iteration, V-cycle and full multigrid cycle. 



In our setting, the transfer operators only change the number of rows m of the input matrix M, 
i.e., the number of pixels in the images of the database: the size of the images are approximatively 
four times smaller between each level: m! ~ 4m. When the number of images in the input matrix is 
not too large, i.e., when n <C m, the computational complexity per iteration of the three algorithms 
(ANLS, MU and HALS) is close to being proportional to m (cf. |A]), and the iterations will then be 
approximately four times cheaper (see also Section f6.1|) . A possible way to allocate the time spent at 
each level is to allow the same number of iterations at each level, which seems to give good results in 
practice. Table [T] shows the time spent and the corresponding number of iterations performed at each 
level. 

Note that the transfer operators require 0(mn) operations and, since they are only performed 
once between each level, their computational cost can be neglected (at least for r>l and/or when a 
sizable amount of iterations are performed). 
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Table 1: Number of iterations performed and time spent at each level when allocating among L levels 
a total computational budget T, corresponding to 4k iterations at the finest level. 





Level 1 


Level 2 




Level L — 1 


Level L 


Total 




(finer) 








(coarser) 




~ # iterations 


3A; 


3k 




3A; 


4k 


(3L + l)k 


time 


3 rji 

A 1 


3 rp 

16 ± 




3 rp 
4L-1 1 




T 



4.3.1 Nested Iteration (NI) 

To initialize NMF algorithms, we propose to factorize the image at the coarsest resolution and then 
use the solution as initial guess for the next (finer) resolution. This is referred to as nested iteration, 
see Figure 0] for an illustration with three levels and Algorithm 0] for the implementation. The idea 
is to start off the final iterations at the finer level with a better initial estimate, thus reducing the 
computational time required for the convergence of the iterative methods on the fine grid. The number 
of iterations and time spent at each level is chosen according to Table [H i.e., three quarters of the 
alloted time for iterations at the current level preceded by one quarter of the time for the recursive 
call to the immediately coarser level. 



> 



Iterations 

Figure 4: Nested Iteration. Transition between different levels for nested iteration. 



Algorithm 4 Nested Iteration 

Require: L G N (number of levels), M G M™ xn (data matrix), (V ,W ) G M+ Xr x R^ xn (initial 

matrices) and T > (total time allocated to the algorithm). 
Ensure: (V, W) > s.t. VW w M. 

1: if L = 1 then 

2: [V, W] = NMF algorithm(M, V , W , T) ; 
3: else 

4: M' = K(M);V{ = K(V ); 

5: [V', W] = Nested Iteration(L - 1, M', F ', W Q ,T/4); 
6: V = P(V); 

7: [V, W) = NMF algorithm(M, V, W, 3T/4) ; 
8: end if 



Remark 1. When the ANLS algorithm is used, the prolongation of V does not need to be computed 
since that algorithm only needs an initial value for the W iterate. Note that this can be used in principle 
to avoid computing any prolongation, by setting V directly as the optimal solution of the corresponding 
NNLS problem. 
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4.3.2 V Cycle (VC) 



One can often empirically observe that multilevel methods perform better if a few iterations are 
performed at the fine level immediately before going to coarser levels. This is partially explained by 
the fact that these first few iterations typically lead to a relatively important decrease of the objective 
function, at least compared to subsequent iterations. A simple application of this strategy is referred 
to as V-cycle and is illustrated on Figure [5] with three levels; see Algorithm [5] for the implementation. 
Time allocation is as follows: one quarter of the alloted time is devoted to iterations at the current 
level, followed by one quarter of the time for the recursive call to the immediately coarser level, and 
finally one half of the time again for iterations at the current level (we have therefore three quarters 
of the total time spent for iterations at current level, as for nested iteration). 




Iterations 

Figure 5: V-cycle. Transition between different levels for V-cycle. 



Algorithm 5 V-cycle 

Require: L £ N (number of levels), M G M™ xn (data matrix), (V ,W ) G M+ Xr x R^ xn (initial 

matrices) and T > (total time allocated to the algorithm). 
Ensure: (V, W) > s.t. VW « M. 

1: if L = 1 then 

2: [V, W] = NMF algorithm(M, V , W Q , T) ; 
3: else 

4: [V, W] = NMF algorithm(M, V , W , T/4); 
5: M' = K(M); V = K{V)\ 
6: [V',W] = V-cycle(L-l,M',V',W,T/4); 
7: V = V{V); 

8: [V, W] = NMF algorithm(M, V, W, T/2); 
9: end if 



4.3.3 Full Multigrid (FMG) 

Combining ideas of nested iteration and V-cycle leads to a full multigrid cycle defined recursively as 
follows: at each level, a V-cycle is initialized with the solution obtained at the underlying level using 
a full- multigrid cycle. This is typically the most efficient multigrid strategy |35| . In this case, we 
propose to partition the time as follows (T is the total time): -j for the initialization (call of the full 
multigrid on the underlying level) and W- for the V-cycle at the current level (cf . Algorithm [6]) . 

4.4 Smoothing Properties 

We explained why the multilevel strategy was potentially able to accelerate iterative algorithms for 
NMF: cheaper computations and smoothing of the error on coarse levels. Before giving extensive 
numerical results in Section [SI we illustrate these features of multilevel methods on the ORL face 
database. 
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Algorithm 6 Full Multigrid 



Require: L G N (number of levels), M G M™ xn (data matrix), (V ,W ) G M+ Xr x E!j_ xn (initial 

matrices) and T > (total time allocated to the algorithm). 
Ensure: (V, W) > s.t. VW « M. 

1: if L = 1 then 

2: [V, W] = NMF algorithm(M, V ,W ,T); 
3: else 

4: V' = K(V ); M' = K{M); * 

5: [y', TV] = Full Multigrid(L - 1, M', V, W ,T/4); 

6: V = prolongation(V'); 

7: [y; W] = V-cycle(L, M, V, W, ST/ A); 

8: end if 

*Note that the restrictions of M should be computed only once for each level and saved as global variables so that 
the call of the V-cycle (step 7) does not have to recompute them. 



Comparing three levels, Figure displays the error (after prolongation to the fine level) for two 
faces and for different numbers of iterations (10, 50 and 100) using MU. Comparing the first row and 
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Figure 6: Smoothing on Coarse Levels. Example of the smoothing properties of the multilevel approach 
on the ORL face database. Each image represents the absolute value of the approximation error (black 
tones indicate a high error) of one of two faces from the ORL face database. These approximations 
are the prolongations (to the fine level) of the solutions obtained using the multiplicative updates on 
a single level, with factorization rank r = 40 and the same initial matrices. From top to bottom: level 
1 (fine), level 2 (middle) and level 3 (coarse); from left to right: 10 iterations, 50 iterations and 100 
iterations. 



the last row of Figure (U it is clear that, in this example, the multilevel approach allows a significant 
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smoothing of the error. After only 10 iterations, the error obtained with the prolongated solution of 
the coarse level is already smoother and smaller (see Figure [7J), while it is computed much faster. 

Figure [7] gives the evolution of the error with respect to the number of iterations performed (left) 
and with respect to computational time (right). In this example, the initial convergence on the three 
levels is comparable, while the computational cost is much cheaper on coarse levels. In fact, compared 
to the fine level, the middle (resp. coarse) level is approximately 4 (resp. 16) times cheaper. 



X10 4 X10 4 




Iterations Time(s.) 



Figure 7: Evolution of the error on each level, after prolongation on the fine level, with respect to (left) 
the number of iterations performed and (right) the computational time. Same setting as in Figure El 



5 Computational Results 

To evaluate the performances of our multilevel approach, we present some numerical results for several 
standard image databases described in Table [2J 



Table 2: Image datasets. 



Data 


# pixels 


m 


n 


r 


ORL face 1 


112 x 92 


10304 


400 


40 


Umist face 2 


112 x 92 


10304 


575 


40 


Iris 3 


960 x 1280 


1228800 


8 


4 


Hubble Telescope [38] 


128 x 128 


16384 


100 
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1 http: //www. cl . cam. ac .uk/research/dtg/attarchive/f acedat abase .html 

2 http: //www. cs . toronto . edu/~roweis/data.html 

3 http: //www. bath. ac .uk/elec-eng/research/sipg 

For each database, the three multigrid cycles (NI, V-cycle and FMG) of our multilevel strategy are 
tested using 100 runs initialized with the same random matrices for the three algorithms (ANLS, MU 
and HALS), with a time limit of 10 seconds. All algorithms have been implemented in MATLAB® 
7.1 (R14) and tested on a 3 GHz Intel® Core™2 Dual CPU PC. 
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5.1 Results 

Tables El H] and [5] give the mean error attained within 10 seconds using the different approaches. 



Table 3: Comparison of the mean error on the 100 runs with ANLS. 





#lvl 


ORL 


Umist 


Iris 


Hubble 


NMF 


1 


14960 


26013 


28934 


24.35 


NI 


2 


14683 


25060 


27834 


15.94 




3 


14591 


24887 


27572 


16.93 




4 


14580 


24923 


27453 


17.20 


VC 


2 


14696 


25195 


27957 


16.00 




3 


14610 


24848 


27620 


16.12 




4 


14599 


24962 


27490 


16.10 


FMG 


2 


14683 


25060 


27821 


16.10 




3 


14516 


24672 


27500 


16.56 




4 


14460 


24393 


27359 


16.70 



Table 4: Comparison of the mean error on the 100 runs with MU. 





#lvl 


ORL 


Umist 


Iris 


Hubble 


NMF 


1 


34733 


131087 


64046 


21.68 


NI 


2 


23422 


87966 


37604 


22.80 




3 


20502 


67131 


33114 


18.49 




4 


19507 


59879 


31146 


16.19 


VC 


2 


23490 


90064 


36545 


10.62 




3 


20678 


69208 


32086 


9.77 




4 


19804 


62420 


30415 


9.36 


FMG 


2 


23422 


87966 


37504 


22.91 




3 


19170 


58469 


32120 


15.06 




4 


17635 


46570 


29659 


11.71 



Table 5: Comparison of the mean error on the 100 runs with HALS. 





#lvl 


ORL 


Umist 


Iris 


Hubble 


NMF 


1 


15096 


27544 


31571 


17.97 


NI 


2 


14517 


25153 


29032 


17.37 




3 


14310 


24427 


28131 


16.91 




4 


14280 


24256 


27744 


16.92 


VC 


2 


14523 


25123 


28732 


17.37 




3 


14339 


24459 


28001 


17.02 




4 


14327 


24364 


27670 


17.04 


FMG 


2 


14518 


25153 


29120 


17.39 




3 


14204 


23950 


27933 


16.69 




4 


14107 


23533 


27538 


16.89 



In all cases, the multilevel approach generates much better solutions than the original NMF algorithms, 
indicating that it is able to accelerate their convergence. The full multigrid cycle is, as expected, the 
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best strategy while nested iteration and V-cycle give comparable performances. We also observe that 
the additional speedup of the convergence when the number of levels is increased from 3 to 4 is less 
significant; it has even a slightly negative effect in some cases. In general, the 'optimal' number of lev- 
els will depend on the smoothness and the size of the data, and on the algorithm used (cf. Section fo.ip . 



HALS combined with the the full multigrid cycle is one of the best strategies. Figure [8] displays the 
distribution of the errors for the different databases in this particular case. For the ORL and Umist 
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Figure 8: Distribution of the error among the 100 random initializations using the HALS algorithm 
with a full multigrid cycle: (top left) ORL, (top right) Umist, (bottom left) Iris, and (bottom right) 
Hubble. 



databases, the multilevel strategy is extremely efficient: all the solutions generated with 2 and 3 levels 
are better than the original NMF algorithm. For the Iris and Hubble databases, the difference is not 
as clear. The reason is that the corresponding NMF problems are 'easier' because the factorization 
rank r is smaller. Hence the algorithms converge faster to stationary points, and the distribution of 
the final errors is more concentrated. 

In order to visualize the evolution of the error through the iterations, Figure [9] plots the objective 
function with respect to the number of iterations independently for each algorithm and each database, 
using nested iteration as the multigrid cycle (which is the easiest to represent). In all cases, prolon- 
gations of solutions from the lower levels generate much better solutions than those obtained on the 
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x10 



.x 10 



x10 




10 15 
Iterations 



Iterations 



Figure 9: Evolution of the objective function. From left to right : MU, ANLS and HALS. From top 
to bottom: ORL, Umist, Iris and Hubble databases. 1 level stands for the standard NMF algorithms. 
The initial points for the curves 2 levels and 3 levels are the prolongated solutions obtained on the 
coarser levels using nested iteration, cf. Section 14.31 All algorithms were initialized with the same 
random matrices. 



fine level (as explained in Section 14. 4p . 



These test results are very encouraging: the multilevel approach for NMF seems very efficient when 
dealing with image datasets and allows a significant speedup of the convergence of the algorithms. 
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6 Limitations 



Although the numerical results reported in the previous section demonstrate significant computational 
advantages for our multilevel technique, we point out in this section two limitations that can potentially 
affect our approach. 

6.1 Size of the Data Matrix 

The approach described above was applied to only one dimension of the input data: restriction and 
prolongation operators are applied to columns of the input matrix M and of the first factor V. Indeed, 
we assumed that each of these columns satisfies some kind of smoothness property. In contrast, we did 
not assume that the columns of M are related to each other in any way, so that no such property holds 
for the rows of M. Therefore we did not apply our multilevel strategy along the second dimension of 
the input data, and our approach only reduced the row dimension m of matrix M at each level from 
m to to' ~ ^r, while the column dimension n remained the same. 

The fact that the row dimension of factor V becomes smaller at deeper levels clearly implies that the 
computational cost associated with updating V will decrease. This reduction is however not directly 
proportional to the reduction from m to m', as this cost also depends on the factorization rank r and 
the dimensions of the other factor, which are not affected. Similarly, although the dimensions of factor 
W remain the same regardless of the depth, its updates could become cheaper because dimension m 
also plays a role there. The relative extent of those effects depends on the NMF algorithm used, and 
will determine in which situations a reduction in the dimension m is clearly beneficial with respect to 
the whole computational cost of the algorithm. 

We now analyze in detail the effect of a reduction of m on the computational cost of one iteration 
of the algorithms presented in Section [2j 

Table 6: Number of floating point operations needed to update V and W in ANLS, MU and HALS. 





ANLS 


MU and HALS 


Update of V 
Update of W 
Both updates 


0(mnr + ms(r)r 3 + nr 2 ) 
0(mnr + ns(r)r 3 + mr 2 ) 
0(m(nr + s(r)r 3 ) + ns(r)r 3 ) 


0(mnr + (m + n)r 2 ) 
0{mnr + (to + n)r 2 ) 
0(m(nr + r 2 ) + nr 2 ) 



(function s(r) is 2 r in the worst case, and typically much smaller, see fSL 

Table O gives the computational cost for the updates of V and W separately, as well as their 
combined cost (see[X]). Our objective is to determine for which dimensions (to, n) of the input matrix 
and for which rank r our multilevel strategy (applied only to the row dimension to) is clearly beneficial 
or, more precisely, find when a constant factor reduction in m, say S = 4, leads to a constant factor 
reduction in the total computational cost of both updates. We make the following observations, 
illustrated on Figure [TUJ 

• We need only consider the region where both to and n are greater than the factorization rank r 
(otherwise the trivial factorization with an identity matrix is optimal). 

• Looking at the last row of the table, we see that all terms appearing in the combined compu- 
tational cost for both updates are proportional to to, except for two terms: ns(r)r 3 for ANLS 
and nr 2 for MU and HALS. If the contributions of those two terms could be neglected compared 
to the total cost, any constant factor reduction in dimension to would lead to an equivalent 
reduction in the total complexity, which is the ideal situation for our multilevel strategy. 
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r 2 s(r) - 




r 2 s(r) 



m (#rows) 



Figure 10: Regions for input dimensions (m, n) where a multilevel strategy is beneficial in all cases 
(m > min{n, r 2 s(r)}, lower and upper right parts) or only for MU and HALS (m < minjn, r 2 s(r)}, 
upper left part). 



When m > n, terms ns(r)r 3 for ANLS and nr 2 for MU and HALS are dominated respectively 
by ms(r)r 3 and mr 2 (i.e., ns(r)r 3 < ms(r)r 3 and nr 2 < mr 2 ), so that they cannot contribute 
more than half of the total computational cost. Therefore a reduction in dimension m will 
guarantee a constant factor reduction in the total complexity. Let us illustrate this on the MU 
(a similar analysis holds for ANLS and HALS) for which the exact total computational cost is 
1m{nr + r 2 ) + 2nr 2 (see|Aj). The factor reduction Jmu m the total complexity satisfies 



1 < fMU 



m(nr + r 2 ) + nr 2 
m'{nr + r 2 ) + nr 2 



< 



m 



4, 



and, for m > n > r and 



fMU > 



4, we have that 

Am'nr + 8m'r 2 



mnr + mr 2 + mr 2 



> 



m'nr + m'r 2 + mr 2 m'nr + bm'r 2 m'r 2 + 5m'r 



Am'r 2 + 8m' r 2 



i.e., the total computational cost of the MU updates on the coarse level is at least twice cheaper 
than on the fine level. Moreover, when m is much larger than n (m n), as is the case for our 
images, the terms in n can be neglected, and we find ourselves in the ideal situation described 
previously (with Jmu ~ 4). In conclusion, when m > n, we always have an appreciable reduction 
in the computational cost. 

Looking now at MU and HALS when m is smaller than n, we see that the term nr 2 is always 
dominated by mnr (i.e., nr 2 < mnr), because m > r always holds. We conclude that a constant 
factor reduction in the total complexity can also be expected when m is reduced. For example, 
for MU, we have 



fMU > 



8m' nr + 4m'r 8 



mnr + mr + mnr 



m'nr + m'r 2 + mnr bm'nr + m'r 2 



> -. 

5 



Considering now ANLS when m is smaller than n, we see that the term ns(r)r 3 is dominated 
by mnr as soon as m > s(r)r 2 . Again, in that situation, a constant factor reduction in the 
total complexity can be obtainecH. Finally, the only situation where the improvement due to 



5 It is worth noting that when m > s(r)r 2 the initial computational cost to formulate the NNLS subproblem in W: 



mm 

w>o 



\\M :i - VW:i\\ 2 F = Y, \\ M --i\\F - '2(M T l V)W :1 + WTi{V T V)W :i 



(6.1) 



which requires the computation of V T V and M T V (cf. O, takes more time than actually solving (|6.1 
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the multilevel technique is modest is when using ANLS when both m < n and m < s(r)r 2 hold, 
in which case the term ns(r)r s can dominate all the others, and a reduction in dimension m is 
not guaranteed to lead to an appreciable reduction in the total complexity. 

To summarize, applying multilevel techniques to the methods presented in this paper is particularly 
beneficial on datasets for which m is sufficiently large compared to n and r (for MU and HALS) and 
to n and s(r)r 2 (for ANLS). Some gains can always be expected for MU and HALS, while ANLS will 
only see a significant improvement if m > min{n, s(r)r 2 } holds. 

In Section [5j we have presented computational experiments for image datasets satisfying this re- 
quirement: the number of images n was much smaller than the number of pixels m in each image. 
In particular, we observed that the acceleration provided by the multilevel approach to the ANLS 
algorithm was not as significant as for HALS: while in most cases ANLS converged faster than HALS 
when using the original NMF algorithms, it converged slower as soon as the multilevel strategy was 
used (see Tables [3] and [5]) . 

To end this section, we note that, in some applications, rows of matrix M can also be restricted 
to lower-dimensional spaces. In these cases, the multilevel method could be made even more effective. 
This is the case for example in the following situations: 

• In hyperspectral images, each column of matrix M represents an image at a given wavelength, 
while each row represents the spectral signature of a pixel, see, e.g., [32j [Tg] . Since spectral 
signatures feature smooth components, the multilevel strategy can be easily generalized to reduce 
the number of rows n of the data matrix M. 

• For a video sequence, each column of matrix M represents an image at a given time so that 
consecutive images share similarities. Moreover, if the camera is fixed, the background of the 
scene is the same among all images. The multilevel approach can then also be generalized to 
reduce the number of columns of M in a meaningful way. 

• In face datasets (e.g., used for face recognition), a person typically appears several times. Hence 
one can imagine using the multilevel strategy by merging different columns corresponding to the 
same person. 

6.2 Convergence 

In classical multigrid methods, when solving a linear system of equations Ax = b, the current approx- 
imate solution x c is not transferred from a fine level to a coarser one, because it would imply the loss 
of its high-frequency components; instead, the residual is transferred, which we briefly explain here. 
Defining the current residual r c = b — Ax c and the error e = x — x c , we have the equivalent defect 
equation Ae = r c and we would like to approximate e with a correction e c in order to improve the 
current solution with x c «— x c + e c . The defect equation is solved approximately on the the coarser 
grid by restricting the residual r c , the correction obtained on the coarser grid is prolongated and the 
new approximation x c + e c is computed, see, e.g., [35, p. 37]. If instead the solution is transferred 
directly from one level to another (as we do in this paper), the corresponding scheme is in general not 
convergent, see |35t p. 156]. In fact, even an exact solution of the system Ax = b is not a fixed point, 
because the restriction of x is not an exact solution anymore at the coarser level (while, in that case, 
the residual r is equal to zero and the correction e will also be equal to zero). 

Therefore, the method presented in this paper should in principle only be used as a pre-processing 
or initialization step before another (convergent) NMF algorithm is applied. In fact, if one already has 
a good approximate solution (V, W) for NMF (e.g., a solution close to a stationary point), transferring 
it to a coarser grid will most likely increase the approximation error because high frequency components 
(such as edges in images) will be lost. Moreover, it seems that the strategy of transferring a residual 
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instead of the whole solution is not directly applicable to NMF. Indeed, a 'local linearization' approach, 
which would consist in linearizing the equation 

M- (V + AV)(W + AW) « <=► R = M -VW ^VAW + AVW, 

where AV and AW are the corrections to be computed on the coarser grids, causes several problems. 
First, handling non-negativity of the coarse versions of the factors becomes non-trivial. Second, 
performing this approximation efficiently also becomes an issue, since for example computing the 
residual R is as expensive as computing directly a full MU or HALS iteration on the fine grid (0(mnr) 
operations). Attempting to fix these drawbacks, which seems to be far from trivial, is a topic for further 
research. 

To conclude this section, we reiterate that, despite these theoretical reservations, it seems our 
technique is still quite efficient (see Section [5]) . One reason that explains that good behavior is that 
NMF solutions are typically part-based and sparse [26| . see Figure [TJ Therefore, columns of matrix 
V contains relatively large 'constant components', made of their zero entries, which are perfectly 
transferred from one level to another, so that sy = — itw]^^ wm typically be very small (in 
general much smaller than 

7 Concluding Remarks 

In this paper, a multilevel approach designed to accelerate NMF algorithms has been proposed and its 
efficiency has been experimentally demonstrated. Applicability of this technique relies on the ability 
to design linear operators preserving nonnegativity and transferring accurately data between different 
levels. To conclude, we give some directions for further research. 

7.1 Extensions 

We have only used our multilevel approach for a specific objective function (sum of squared errors) 
to speed up three NMF algorithms (ANLS, MU and HALS) and to factorize 2D images. However, 
this technique can be easily generalized to different objective functions, other iterative algorithms and 
applied to various kinds of smooth data. In fact, the key characteristic we exploit is the fact that a 
reduction of the dimension(s) of the input matrix (in our numerical examples, m) leads to cheaper 
iterations (on coarse levels) for any reasonable algorithm, i.e., any algorithm whose computational 
cost depends on the dimension of the input matrix (see also the more detailed analysis in Section 16.1)) . 

Moreover, other types of coarse grid definition (e.g., red-black distribution), transfer operators (e.g., 
wavelets transform) and grid cycle (e.g., W-cycle or flexible cycle) can be used and could potentially 
further improve efficiency. 

This idea can also be extended to nonnegative tensor factorization (NTF), see, e.g., [38] and the 
references therein, by using multilevel techniques for higher dimensional spaces. 

7.2 Initialization 

Several judicious initializations for NMF algorithms have been proposed in the literature which allow 
to accelerate convergence and, in general, improve the final solution [12\ 12]. The computational cost 
of these good initial guesses depends on the matrix dimensions and will then be cheaper on a coarser 
grid. Therefore, it would be interesting to combine classical NMF initializations techniques with our 
multilevel approach for further speedups. 

7.3 Unstructured data 

When we do not possess any kind information about the matrix to factorize (and a fortiori about the 
solution), applying a multilevel method seems out of reach. In fact, in these circumstances, there is 



19 



no sensible way to define the transfer operators. 

Nevertheless, we believe it is not hopeless to extend the multilevel idea to other types of data. 
For example, in text mining applications, the term-by-document matrix can be restricted by stacking 
synonyms or similar texts together, see [33] where graph coarsening is used. This implies some a 
priori knowledge or preprocessing of the data and, assuming it is cheap enough, the application of a 
multilevel strategy could be expected to be profitable in that setting. 

Acknowledgments 

We thank Quentin Rentmeesters, Stephen Vavasis and an anonymous reviewer for their insightful 
comments which helped improve the paper. 

References 

[1] M. Berry, M. Browne, A. Langville, P. Pauca, and R. Plemmons, Algorithms and 
Applications for Approximate Nonnegative Matrix Factorization, Computational Statistics and 
Data Analysis, 52 (2007), pp. 155-173. 

[2] C. Boutsidis AND E. Gallopoulos, SVD based initialization: A head start for nonnegative 
matrix factorization, Journal of Pattern Recognition, 41 (2008), pp. 1350-1362. 

[3] J. H. Bramble, Multigrid methods, Number 294 Pitman Research Notes in Mathematic Series. 
Longman Scientific & Technical, UK, 1995. 

[4] A. Brandt, Guide to multigrid development, W. Hackbusch and U. Trottenberg, eds., Multigrid 
Methods, Lecture Notes in Mathematics, Springer, 960 (1982), pp. 220-312. 

[5] W. L. Briggs, A Multigrid Tutorial, SIAM, Philadelphia, 1987. 

[6] D. Chen AND R. Plemmons, Nonnegativity Constraints in Numerical Analysis, in A. Bultheel 
and R. Cools (Eds.), Symposium on the Birth of Numerical Analysis, World Scientific Press., 
2009. 

[7] A. Cichocki, S. Amari, R. Zdunek, and A. Phan, Non-negative Matrix and Tensor Fac- 
torizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation, 
Wiley-Blackwell, 2009. 

[8] A. Cichocki and A.-H. Phan, Fast local algorithms for large scale Nonnegative Matrix and 
Tensor Factorizations, IEICE Transactions on Fundamentals of Electronics, Vol. E92-A No. 3 
(2009), pp. 708-721. 

[9] A. Cichocki, R. Zdunek, and S. Amari, Non-negative Matrix Factorization with Quasi- 
Newton Optimization, in Lecture Notes in Artificial Intelligence, Springer, vol. 4029, 2006, 
pp. 870-879. 

[10] , Hierarchical ALS Algorithms for Nonnegative Matrix and 3D Tensor Factorization, in 

Lecture Notes in Computer Science, Vol. 4666, Springer, pp. 169-176, 2007. 

[11] , Nonnegative Matrix and Tensor Factorization, IEEE Signal Processing Magazine, (2008), 

pp. 142-145. 

[12] J. Curry, A. Dougherty, and S. Wild, Improving non-negative matrix factorizations through 
structured initialization, Journal of Pattern Recognition, 37(11) (2004), pp. 2217-2232. 



20 



M. E. Daube-Witherspoon and G. Muehllehner, An iterative image space reconstruction 
algorithm suitable for volume ect, IEEE Trans. Med. Imaging, 5 (1986), pp. 61-66. 

I. Dhillon, D. Kim, and S. Sra, Fast Newton-type Methods for the Least Squares Nonnegative 
Matrix Approximation problem, in Proc. of SIAM Conf. on Data Mining, 2007. 

N. Gillis, Nonnegative Matrix Factorization: Complexity, Algorithms and Applications, PhD 
thesis, Universite catholique de Louvain, 2011. 

N. Gillis and F. Glineur, Nonnegative Factorization and The Maximum Edge Biclique Prob- 
lem. CORE Discussion paper 2008/64, 2008. 

, Nonnegative Matrix Factorization and Underapproximation. Communication at 9th Inter- 



national Symposium on Iterative Methods in Scientific Computing, Lille, France, 2008. 

N. Gillis and R. Plemmons, Dimensionality Reduction, Classification, and Spectral Mixture 
Analysis using Nonnegative Underapproximation, Optical Engineering, 50, 027001 (2011). 

S. Gratton, A. Sartenaer, and P. Toint, On Recursive Multiscale Trust-Region Algorithms 
for Unconstrained Minimization, in Oberwolfach Reports: Optimization and Applications. 

J. Han, L. Han, M. Neumann, and U. Prasad, On the rate of convergence of the image space 
reconstruction algorithm, Operators and Matrices, 3(1) (2009), pp. 41-58. 

N.-D. Ho, Nonnegative Matrix Factorization - Algorithms and Applications, PhD thesis, Univer- 
site catholique de Louvain, 2008. 

H. Kim and H. Park, Non-negative Matrix Factorization Based on Alternating Non-negativity 
Constrained Least Squares and Active Set Method, SIAM J. Matrix Anal. Appl., 30(2) (2008), 
pp. 713-730. 

J. Kim and H. Park, Toward Faster Nonnegative Matrix Factorization: A New Algorithm and 
Comparisons, in Proc. of IEEE Int. Conf. on Data Mining, 2008, pp. 353-362. 

, Fast nonnegative matrix factorization: An active-set-like method and comparisons, SIAM 

J. on Scientific Computing, (2011). to appear. 

C. Lawson and R. Hanson, Solving Least Squares Problems, Prentice-Hall, 1974. 

D. Lee and H. Seung, Learning the Parts of Objects by Nonnegative Matrix Factorization, 
Nature, 401 (1999), pp. 788-791. 

, Algorithms for Non-negative Matrix Factorization, In Advances in Neural Information Pro- 



cessing, 13 (2001). 

L. Li and Y.-J. Zhang, FastNMF: highly efficient monotonic fixed-point nonnegative matrix 
factorization algorithm with good applicability, J. Electron. Imaging, Vol. 18 (033004) (2009). 

C.-J. Lin, On the Convergence of Multiplicative Update Algorithms for Nonnegative Matrix Fac- 
torization, in IEEE Transactions on Neural Networks, 2007. 

, Projected Gradient Methods for Nonnegative Matrix Factorization, Neural Computation, 

19 (2007), pp. 2756-2779. MIT press. 

P. Paatero and U. Tapper, Positive matrix factorization: a non-negative factor model with 
optimal utilization of error estimates of data values, Environmetrics, 5 (1994), pp. 111-126. 



21 



[32] P. Pauca, J. Piper, and R. Plemmons, Nonnegative matrix factorization for spectral data 
analysis, Linear Algebra and its Applications, 406(1) (2006), pp. 29-47. 

[33] S. Sakellaridi, H.-r. Fang, and Y. Saad, Graph-based Multilevel Dimensionality Reduction 
with Applications to Eigenfaces and Latent Semantic Indexing, in Proc. of the 7th Int. Conf. on 
Machine Learning and Appl., 2008. 

[34] D. Terzopoulos, Image Analysis Using Multigrid Relaxation Methods, J. Math. Phys., PAMI- 
8(2) (1986), pp. 129-139. 

[35] U. Trottenberg, C. Oosterlee, and A. Schuller, Multigrid, Elsevier Academic Press, 
London, 2001. 

[36] M. Van Benthem and M. Keenan, Fast algorithm for the solution of large-scale non-negativity 
constrained least squares problems, J. Chemometrics, 18 (2004), pp. 441-450. 

[37] S. Vavasis, On the complexity of nonnegative matrix factorization, SI AM Journal on Optimiza- 
tion, 20 (2009), pp. 1364-1377. 

[38] Q. Zhang, H. Wang, R. Plemmons, and P. Pauca, Tensor methods for hyperspectral data 
analysis: a space object material identification study, J. Optical Soc. Amer. A, 25(12) (2008), 
pp. 3001-3012. 

A Computational Cost of ANLS, MU and HALS 
A.l MU and HALS 

The main computational cost for updating V in both MU and HALS resides in the computation of 
MW T and! V \WW T ), which requires respectively 2mnr and 2(m + n)r 2 operations, cf. Algorithms [2] 
and [3l Updating W requires the same number of operations, so that the total computational cost is 
0(rnnr + (m + n)r 2 ) operations per iteration, almost proportional to m (only the nr 2 term is not, but 
is negligible compared to the other terms, cf. Subsection 16. ip . see also [151 Section 4.2.1]. 

A.2 Active-Set Methods for NNLS 

In a nutshell, active-set methods for nonnegative least squares work in the following iterative fash- 
ion [251 Algorithm NNLS, p. 161] 

0. Choose the set of active (zero) and passive (nonzero) variables. 

1. Get rid of the nonnegativity constraints and solve the unconstrained least squares problem (LS) 
corresponding to the set of passive (nonzero) variables (the solution is obtained by solving a 
linear system, i.e., the normal equations); 

2. Check the optimality conditions, i.e., the nonnegativity of passive variables, and the nonnega- 
tivity of the gradients of the active variables. If they are not satisfied: 

3. Exchange variables between the set of active and the set of passive variables in such a way that 
the objective function is decreased at each step; and go to 1. 

6 In HALS, VWW T is essentially computed one column at a time, see [151 Section 4.2.1]. 
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In (jNMFp , the problem of computing the optimal V for a given fixed W can be decoupled into m 
independent NNLS subproblems in r variables: 

min 1 1 Mi- - Vi-Wlli, 1 < i < m. 

Each of them amounts to solving a sequence of linear subsystems (with at most r variables, cf. step 1 
above) of 

VrXWW T ) = M l ,W T , 1 < i < m. 

In the worst case, one might have to solve every possible subsystem, which requires 0(g(r)) operations 
witrjll g(r) = YH=i = @(2 r 7" 3 )- Note that WW T and MW T can be computed once for all, which 
requires 0(mnr + nr 2 ) operations (see previous section on MU and HALS). Updating V then requires 
0(mnr + ms(r)r 3 + nr 2 ) operations, while updating W similarly requires Oiranr + ns{r)r ?J + mr 2 ). 
Finally, the total computational cost of one ANLS step is 0(mnr + (m + n)r 2 (rs(r) + 1)) = 0(mnr + 
(m + n)s(r)r 3 ) operations per iteration, where s(r) < 2 r . The number of steps s(r) is @(2 r ) in the 
worst case, but in practice is typically much smaller (as is the case for the simplex method for linear 
programming) . 

When m is reduced by a certain factor (e.g., four as in the multilevel approach presented in 
Section SJ), the computational cost is not exactly reduced by the same factor, because the leading 
(m + n) factor above also depends on n. However, in our applications, when m (number of pixels) is 
much larger than n (number of images), one can roughly consider the cost per iteration to be reduced 
by the same factor, since « ^ (see also the more detailed discussion in Subsection 16. ip . 



7 One can check that (2 (r ~ 3) - l)(r - 2) 3 < g(r) < 2 r r 3 . 
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